final_project.Rmd.
The PSID (Panel Study of Income Dynamics) began in 1968, studying over 18,000 individuals living in 5,000 families. In 1997, the PSID launched the CDS (Child Development Supplement) I-III studying the children of these families. The TAS (Transition into Adulthood Supplement) began in 2005 which collected data from these very children transitioning into young adults. The following data is a custom dataset composed of the CDSII (children interviewed in 2001) and follows these children in the TAS 2011.
Want to know (i.e. dependent variable) “change” = 2 (desirable move/state) vs 0 (steady but not ideal move/state) & 1 (undesirable move/state)
The questions I’m looking to answer are
These are the methods of approaches I have considered to analyze the data
The following dataset I am analyzing is a filtered version of a custom dataset Dr. Ashley Palmer compiled from a study done by Dr. Corey Keyes in 2006 using the CDSII data.
Some features were removed from Dr. Palmer’s custom dataset because they were derived from the original CDSII and TAS2011 questions that are used to create the “change” features - the features that I am studying.
In the effort to factorialize “mhstatus” and “change” I had to either remove the missing values, or account for them. Since there is no way to determine whether a missing value indicates a languishing status (perhaps due to the incomplete nature of the survey) or an incomplete status, I chose to remove them as my main focus is what features aid in positive changes. As such, by removing 30% or 835 participant surveys, the data is heavily skewed toward those experiencing positive changes.
Upon removing the missing data, I have removed the data being skewed toward stable but unideal states and states of decline. For this reason I did not need to do any sort of miniority boosting or downsampling the majority sample via SMOTE.
Some data had some NA and I will determine the appropriateness of imputing the missing values. Simply removing all observations with NA as a feature leaves us with no data to analyze. So the question is whether imputation or substituting NA’s with a dummy value is appropriate.
I also created a separate dataset containing only objective data, as through my modeling, I noticed a lot of the subjective indicators were strongly influencing the outcomes. I wanted to see what was contributing to these subjective indicators, so I created a dataset of only “objective” features.
Many features are nominal, making it inappropriate to scale the features. FIX THIS……..With roughly 300 features that are similar in nature…….FIX THIS
Header File: Defining Datasets
#load and convert Stata data into R dataframe frames into environment
dim(df_full)## [1] 1966 267
dim(df_objective)## [1] 1966 207
convertStatDataType <- function(strType) {
mapping <- c("%8.0g"="byte", "%8.0g"="int", "%12.0g"="long", "%9.0g"="float", "%10.0g"="double", "%#s"="str#", "%9s"="strL")
mapping[strType]
}
convertDataframeDisplay <- function(df_convert) {
dfDisplay <- data.frame(matrix(ncol = 0, nrow = ncol(df_convert)))
vectName <- vector(length = ncol(df_convert))
vectLabel <- vector(length = ncol(df_convert))
vectFormat <- vector(length = ncol(df_convert))
vectDescriptions <- vector(length = ncol(df_convert))
for(i in 1:ncol(df_convert)) {
vectName[i] <- names(df_convert[, i])
if(is.null(attributes(df_convert[[i]])$label) == F)
vectLabel[i] <- attributes(df_convert[[i]])$label
if(is.null(attributes(df_convert[[i]])$format.stata) == F)
vectFormat[i] <- lapply(attributes(df_convert[[i]])$format.stata, convertStatDataType)
if(is.null(attributes(df_convert[[i]])$class) == F && attributes(df_convert[[i]])$class == "factor") {
vectFormat[i] <- "factor"
vectDescriptions[i] <- str_flatten(attributes(df_convert[[i]])$levels, "<br />")
}
if(is.null(attributes(df_convert[[i]])$class) == F && attributes(df_convert[[i]])$class != "factor" && attributes(df_convert[[i]])$class[3] == "double")
vectFormat[i] <- "double"
if(is.null(attributes(df_convert[[i]])$labels) == F && length(attributes(df_convert[[i]])$labels) > 0) {
vectLabels <- list(length(attributes(df_convert[[i]])$labels))
for(j in 1:length(attributes(df_convert[[i]])$labels)) {
vectLabels[j] <- as.character(attributes(df_convert[[i]])$labels[j])
if(is.null(names(attributes(df_convert[[i]])$labels[j])) == F && as.character(names(attributes(df_convert[[i]])$labels[j])) != "Actual number")
vectLabels[j] <- paste(c(vectLabels[j], ":", as.character(names(attributes(df_convert[[i]])$labels[j]))), collapse = " ")
else if(as.character(names(attributes(df_convert[[i]])$labels[j])) == "Actual number")
vectLabels[j] <- paste0(vectLabels[j], ":::")
}
vectDescriptions[i] <- str_flatten(vectLabels, "<br />")
vectDescriptions[i] <- str_replace_all(vectDescriptions[i], ":::<br />", ", ")
}
}
dfDisplay$ColumnName <- vectName
dfDisplay$Label <- vectLabel
dfDisplay$Format<- vectFormat
dfDisplay$Labels <- vectDescriptions
return(dfDisplay)
}df_display <- convertDataframeDisplay(df_full_display)
datatable(df_display, options = list(
pageLength=10,
lengthMenu=c(10,50,100,150,250,300)
),
escape=F
)df_objective_display <- convertDataframeDisplay(df_objective_display)
datatable(df_objective_display, options = list(
pageLength=10,
lengthMenu=c(10,50,100,150,250,300)
),
escape=F
)plot_ly(x = ~mhstatus, type="histogram") %>%
layout(title = "Distribution of Mental Health Status", xaxis = list(title = "Mental Health Status"), bargap=0.1,
legend = list(orientation = 'h', title = list(text = "<b>Mental Health</b>")))prop.table(table(mhstatus_train))## mhstatus_train
## Languishing to Languishing Languishing to Moderate
## 0.002544529 0.020356234
## Languishing to Flourishing Moderate to Languishing
## 0.015903308 0.016539440
## Moderate to Moderate Moderate to Flourishing
## 0.364503817 0.330788804
## Flourishing to Languishing Flourishing to Flourishing
## 0.002544529 0.246819338
## Missing
## 0.000000000
prop.table(table(mhstatus_test))## mhstatus_test
## Languishing to Languishing Languishing to Moderate
## 0.005076142 0.025380711
## Languishing to Flourishing Moderate to Languishing
## 0.017766497 0.020304569
## Moderate to Moderate Moderate to Flourishing
## 0.390862944 0.329949239
## Flourishing to Languishing Flourishing to Flourishing
## 0.000000000 0.210659898
## Missing
## 0.000000000
prop.table(table(df_full_train_scaled$PositiveChange))##
## -1.06601041843707 0.937599985623426
## 0.4923858 0.5076142
prop.table(table(df_full_test_scaled$PositiveChange))##
## -1.06601041843707 0.937599985623426
## 0.4923858 0.5076142
We will compare the mean and median for data that has been imputed vs data that have not been imputed. While the patterns appear similar, the y-axis for both the means and medians are vastly different. To preserve the integrity of the data, I will not utilize imputation. I tried doing a violin plot, however it appears there is little variation between among the data.
library(vioplot)
plot_violin <- function(df, df_impute) {
filter <- c("yaearnings_10", "Q23L30B")
df <- df[ , !names(df) %in% filter]
df_impute <- df_impute[ , !names(df_impute) %in% filter]
df_unimputed_long <- gather(df, feature, measurement)
feature_unimputed <- df_unimputed_long$feature
feature_label_unimputed <- unlist(lapply(feature_unimputed, fieldname_to_description))
measurement_unimputed <- df_unimputed_long$measurement
# measurement_unimputed <- unlist(lapply(df_unimputed_long$measurement, function(x) { if(x == "-Inf") x = 0; return(x)}))
df_unimputed_plotly <- data.frame(Feature = feature_unimputed, FeatureLabel = feature_label_unimputed, Measurement = measurement_unimputed)
df_imputed_long <- gather(df, feature, measurement)
feature_imputed <- df_imputed_long$feature
feature_label_imputed <- unlist(lapply(feature_imputed, fieldname_to_description))
measurement_imputed <- df_imputed_long$measurement
# measurement_imputed <- unlist(lapply(df_imputed_long$measurement, function(x) { if(x == "-Inf") x = 0; return(x)}))
df_imputed_plotly <- data.frame(Feature = feature_imputed, FeatureLabel = feature_label_imputed, Measurement = measurement_imputed)
p <- plot_ly(type="violin")
p <- p %>% add_trace(data = df_unimputed_plotly, x = ~FeatureLabel, y = ~Measurement, legendgroup = "unimputed", scalegroup = "unimputed", name="unimputed", side = "negative", box = list(visible = T), meanline = list(visible = T), color = "red")
p <- p %>% add_trace(data = df_imputed_plotly, x = ~FeatureLabel, y = ~Measurement, legendgroup = "imputed", scalegroup = "imputed", name="imputed", side = "postive", box = list(visible = T), meanline = list(visible = T), color = "Green")
p <- p %>% layout(
xaxis = list(title = "Fieldnames"), yaxis = list(title = "Distribution", zeroline = F)
)
p
return(p)
}
plot_summary <- function(df, string_type) {
sum_mat <- summary(df)
vec_medians <- sum_mat[3, ]
vec_means <- sum_mat[4, ]
vec_fieldnames <- trimws(colnames(sum_mat))
for(i in 1:length(vec_medians)) {
vec_medians[i] <- as.numeric(str_replace(vec_medians[i], "Median :", ""))
vec_means[i] <- as.numeric(str_replace(vec_means[i], "Mean :", ""))
}
df_new <- data.frame(
medians = vec_medians,
means = vec_means,
fieldnames = vec_fieldnames
)
p <- plot_ly()
p <- p %>% add_trace(data = df_new, x = ~fieldnames, y = ~means, type = "bar", name="Means")
p <- p %>% layout(title=paste0("Distribution of Means of Summary for ", string_type), yaxis = list(title = "Count", xaxis = list(title = "Field Names")))
q <- plot_ly()
q <- q %>% add_trace(data = df_new, x = ~fieldnames, y = ~medians, type = "bar", name="Medians")
q <- q %>% layout(title=paste0("Distribution of Medians of Summary for ", string_type), yaxis = list(title = "Count", xaxis = list(title = "Field Names")))
return(list(p, q))
}
set.seed(1234)
df_full_impute_forest <- as.data.frame(missForest::missForest(as.matrix(df_full_na), maxiter=1)$ximp)## missForest iteration 1 in progress...done!
plot_violin(df_full_na, df_full_impute_forest)